I am feeling good about the course at the moment. In the beginning, I encountered some errors about connecting Rstudio and Github. I was a bit frustrating at that time, But after I managing to solve it, I feel encouraged.
I have used Rstudio a couple of times before without Rmarkdown. In this course, I expect to learn Rmarkdown and Github. Both of them are powerful tools and I believe that I will use them in my research in the future.
I found this course on sisu and I was attracted by its name, ‘open data science’
I find this book clear and well-organized. I would recommend this book to everyone who starts to learn R
https://github.com/keyanliu1/IODS-project
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Tue Dec 13 07:39:33 2022"
This work includes performing and interpreting regression analysis. Specifically, this work includes plotting the data frame, fitting linear regression model, interpreting results and producing diagnostic plots.
The data is collected from ‘international survey of Approaches to Learning’, made possible by Teachers’ Academy funding for Kimmo Vehkalahti in 2013-2015. with some data wrangling, the final dataset i am working with contains 166 observations and 7 variables. The variables include gender, age, attitude, deep, stra, surf and points.
date()
## [1] "Tue Dec 13 07:39:33 2022"
#library
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ purrr 0.3.5
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# read the data into memory
lrn14 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep=",", header=TRUE)
#check the dimension and structure of the data
dim(lrn14)
## [1] 166 7
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(lrn14[-1])
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(lrn14, mapping = aes(col = gender ), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p2 <- ggpairs(lrn14, mapping = aes(col = gender, alpha=0.3 ), lower = list(combo = wrap("facethist", bins = 20)))
p2
#summary of the data
summary(lrn14)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Part 2.2 shows the graphical overview of the data. The distributions of the variables can be found on the diagonal of the plot. The relationships among the variables are observed on the off-diagonal. R console table summaries the variables.
# create a regression model with multiple explanatory variables
my_model <- lm(points ~ deep + stra + surf, data = lrn14)
#summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ deep + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1235 -3.0737 0.5226 4.2799 10.3229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.9143 5.1169 5.260 4.5e-07 ***
## deep -0.7443 0.8662 -0.859 0.3915
## stra 0.9878 0.5962 1.657 0.0994 .
## surf -1.6296 0.9153 -1.780 0.0769 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared: 0.04071, Adjusted R-squared: 0.02295
## F-statistic: 2.292 on 3 and 162 DF, p-value: 0.08016
From the summary of the model, I find that the stra has a positive statistical relationship with points(0.99), whereas surf variable has negative relationship(-1.63). the effect of deep on points is not significant because the p-value is bigger than 0.1(p-value is the result of a null hypothesis significance test for the true parameter equal to zero). Therefore, I remove the deep variable,include attitude and rerun the model
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra + surf, data = lrn14)
#summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
From the new model, I find that attitude has a significant positive impact on points. however, the effect of surf is not significant. I remove surf and rerun the model
# create a regression model with multiple explanatory variables
my_model3 <- lm(points ~ attitude + stra , data = lrn14)
#summary of the model
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude + stra, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
From the summary of the model above, the estimated coefficient of attitude is 3.4658. The interpretation is that, holding other variables unchanged, one unit increase of attitude results in 3.46 unit increase of point. The positive effect is significant as the p-value of the coefficient is very low. using the same interpretation, the estimated coefficient of stra is 0.91, which means, holding other variables unchanged, one unit increase of stra results in 0.91 unit increase of point. However, this effect is only in 0.1 significance level as p-value is smaller than 0.1 but greater than 0.05. The adjusted R-square is 0.195. The interpretation of this figure is that 0.195 of the variability in the dependent is explained by the explanatory variables.
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model3)
Linear regression models have four main assumptions: - Linear relationship between predictors and outcome; - Independence of residuals; - Normal distribution of residuals; - Equal variance of residuals.
From the diagnostic plot, I am able to observe properties of the residuals. From Q-Q plot, the residuals match normal distribution quite well. From Residuals vs fitted graph, the residuals are randomly distributed on both sides of fitted values, and the distance from both sides are quite the same. In this way, I interpret the validity of those assumptions based on the diagnostic plots.
This work includes performing and interpreting logistic regression analysis. Specifically, this work includes plotting the data frame, fitting logistic regression model, and interpreting results .
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). The two datasets were modeled under binary/five-level classification and regression tasks.
date()
## [1] "Tue Dec 13 07:39:38 2022"
#library
library(GGally)
library(ggplot2)
library(dplyr)
library(tidyverse)
# read the data into memory
library(readr)
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. In this exercise, I choose famrel, studytime, age and freetime variables and study the relationship between the 4 variables and alcohol consumption. I assume that the famrel and studytime have nagative impact on alcohol consumption, while age and freetime have positive impact on it.
This subchapter uses the box plots to explore the distributions of my chosen variables and their relationships with alcohol consumption.
#distribution plots using bar plot
# initialize a plot of alcohol use
g_1 <- ggplot(data = alc, aes(x = alc_use))
g_1 + geom_bar()
g_2 <- ggplot(data = alc, aes(x = high_use))
g_2 + geom_bar()
g_3 <- ggplot(data = alc, aes(x = famrel))
g_3 + geom_bar()
g_4 <- ggplot(data = alc, aes(x = studytime))
g_4 + geom_bar()
g_5 <- ggplot(data = alc, aes(x = age))
g_5 + geom_bar()
g_6 <- ggplot(data = alc, aes(x = freetime))
g_6 + geom_bar()
# initialize a plot of high_use and famrel
g1 <- ggplot(alc, aes(x = high_use, y = famrel))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("family relationship")
# studytime and alchohol consumption
g2 <- ggplot(alc, aes(x = high_use, y = studytime))
g2 + geom_boxplot() + ylab("study time")
# age and alchohol consumption
g3 <- ggplot(alc, aes(x = high_use, y = age))
g3 + geom_boxplot() + ylab("age")
# freetime and alchohol consumption
g4 <- ggplot(alc, aes(x = high_use, y = freetime))
g4 + geom_boxplot() + ylab("freetime")
Part 3.3 shows the graphical overview of the variables. The distributions of the variables can be found in the bar plots. I find that the distributions of alcohol use,high_use, studytime and age is left-skewed. While the distributions of famrel and freetime are right-skewed. From the box plots, it seems that famrel and studytime are nagatively correlated with alcohol use. However, age and freetime seem to have no effect on alcohol use.
This part runs the logistic regression with the variables selected above
# find the model with glm()
m <- glm(high_use ~ famrel + studytime + age + freetime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + studytime + age + freetime,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7542 -0.8553 -0.6349 1.1722 2.2035
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.2910 1.8099 -1.818 0.069019 .
## famrel -0.3542 0.1315 -2.693 0.007086 **
## studytime -0.5725 0.1576 -3.632 0.000281 ***
## age 0.2221 0.1023 2.170 0.029983 *
## freetime 0.3793 0.1264 3.000 0.002701 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 415.07 on 365 degrees of freedom
## AIC: 425.07
##
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp
CI <- confint(m)
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.03721707 -6.88776678 0.23054128
## famrel 0.70173583 -0.61460703 -0.09722191
## studytime 0.56414042 -0.89131071 -0.27196009
## age 1.24866753 0.02362678 0.42612657
## freetime 1.46128242 0.13517480 0.63202911
From the table, I observe that the estimated coefficients of famrel and studtime are -0.35 and -0.57 respectively, and they are statistically significant concluded from p-value. The results indicate that famrel adn studytime have negative relationship with alcohol consumption, which jusitifies my assumption before. On the other hand, The estimated coefficients of age and freetime are positive(0.22,0.38 respectively). The p-value of age coeffienct is 0.03, which indicates that the coefficient is significate at 0.05 level but not at 0.01 level. In addition, the odds ratio, which is exponential of the estimated coeffients, and confidence interval are given in the table. the exponents of the coefficients of a logistic regression model can be interpreted as odds ratios between a unit change (vs. no change) in the corresponding explanatory variable.
## prediction
## high_use FALSE TRUE
## FALSE 238 21
## TRUE 92 19
This part shows the predictive power of the model. A 2x2 cross tabulation is provided. The general predictive performance is good. However, a type of error is significant. When the high_use is true, the model always wrongly predict the type of alcohol use. The training error is 0.305, which is quite high. A better model may be needed. However, this model is still better than some simple guessing strategy.
In this part,a 10-fold cross-validation on my model is performed.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3189189
The average number of wrong predictions in the cross validation is 0.32, which is larger than 0.26. Unfortunately, my model does not have better test set performance than the model in exercise 3.
This work includes performing and interpreting clustering and classification. Specifically, this work includes plotting the data frame, fitting Linear discriminant analysis, prediction and K-means clustering .
Load the Boston data from the MASS package. The data is titled Housing Values in Suburbs of Boston and it contains 506 rows and 14 columns.
date()
## [1] "Tue Dec 13 07:39:40 2022"
#library
library(GGally)
library(ggplot2)
library(dplyr)
library(tidyverse)
# read the data into memory
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
This subchapter uses the box plots to explore the distributions of my chosen variables and their relationships with alcohol consumption.
# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(Boston[1:7])
# draw the plot
p2 <- ggpairs(Boston[1:7], mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p2
p3 <- ggpairs(Boston[8:14], mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p3
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## black lstat medv
## crim -0.38506394 0.4556215 -0.3883046
## zn 0.17552032 -0.4129946 0.3604453
## indus -0.35697654 0.6037997 -0.4837252
## chas 0.04878848 -0.0539293 0.1752602
## nox -0.38005064 0.5908789 -0.4273208
## rm 0.12806864 -0.6138083 0.6953599
## age -0.27353398 0.6023385 -0.3769546
## dis 0.29151167 -0.4969958 0.2499287
## rad -0.44441282 0.4886763 -0.3816262
## tax -0.44180801 0.5439934 -0.4685359
## ptratio -0.17738330 0.3740443 -0.5077867
## black 1.00000000 -0.3660869 0.3334608
## lstat -0.36608690 1.0000000 -0.7376627
## medv 0.33346082 -0.7376627 1.0000000
# visualize the correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")
Part 4.2 shows the graphical overview of the variables. The distributions of the variables can be found in the second and third plots. The last plot is correlation plot which clearly shows the correlation among the variables.
This part makes some modifications to the data. Specifically, I scale the data, Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
library(MASS)
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crim <- as.numeric(boston_scaled$crim)
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low","med_low","med_high", "high"), include.lowest = TRUE)
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
n
## [1] 506
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
In this part,a linear discriminant analysis on the test dataset is performed. The categorical crime rate is used as the target variable and all the other variables in the dataset as predictor variables. This part also draws the LDA (bi)plot.
# linear discriminant analysis
lda.fit <- lda(crime~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2722772 0.2351485 0.2599010 0.2326733
##
## Group means:
## zn indus chas nox rm age
## low 0.96726922 -0.9059641 -0.12916179 -0.8962866 0.44964637 -0.9081391
## med_low -0.09502722 -0.3265046 -0.06511327 -0.5662169 -0.14101064 -0.3301678
## med_high -0.37943482 0.1607140 0.25261763 0.3613021 0.08912785 0.4275011
## high -0.48724019 1.0172896 -0.10479289 1.0887916 -0.30332842 0.7844916
## dis rad tax ptratio black lstat
## low 0.8873301 -0.6864020 -0.7340216 -0.4648813 0.3812548 -0.770953572
## med_low 0.3223853 -0.5502894 -0.4986894 -0.1574159 0.3140497 -0.136661104
## med_high -0.3586944 -0.4142004 -0.3206812 -0.2429668 0.1260048 -0.009231174
## high -0.8250980 1.6363892 1.5128120 0.7787521 -0.7517224 0.744945651
## medv
## low 0.55065881
## med_low 0.00410129
## med_high 0.15849227
## high -0.66439021
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11252334 0.685740948 -0.90206618
## indus 0.05088735 -0.193724231 0.17391560
## chas -0.09590020 -0.134892430 -0.01940927
## nox 0.40232522 -0.628235291 -1.45818214
## rm -0.08127539 -0.077980346 -0.22073613
## age 0.28846191 -0.407828209 -0.16379426
## dis -0.01610702 -0.236202348 -0.04800452
## rad 3.18684931 1.047248677 0.10558259
## tax -0.03109747 -0.171140007 0.50603904
## ptratio 0.11399966 0.058745130 -0.50053621
## black -0.12463229 -0.006794907 0.11354879
## lstat 0.20537017 -0.212244868 0.46759056
## medv 0.16987237 -0.332253220 -0.22184497
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9451 0.0419 0.0129
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 4,col = classes)
lda.arrows(lda.fit, myscale = 3)
In this chaper, I predict the crime classes with the
test data.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# lda.fit, correct_classes and test are available
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes , predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 11 6 0 0
## med_low 8 15 8 0
## med_high 0 4 16 1
## high 0 0 0 33
The cross tabulate results indicates a quite good results of prediction based on LDA. One exception is that when the correct data is med_high, the probability that the model wrongly predicts it with med_low is quite high(15/30).
This part calculates the distances between the observations. Run k-means algorithm on the dataset and investigate the optimal number of clusters,
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# scale the data
Boston<- as.data.frame(scale(Boston))
Boston$crim <- as.numeric(boston_scaled$crim)
# euclidean distance matrix
dist_eu <- dist(Boston)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.5760 4.9401 4.9913 6.3033 12.8856
# manhattan distance matrix
dist_man <- dist(Boston, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2645 8.9648 13.2765 14.1297 18.5263 46.5452
set.seed(13)
# k-means clustering
km <- kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <- kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston[6:10],, col = km$cluster)
The optimal number of clustes are 2. This is determined by observing that the total WCSS drops radically when k is 2. In the pairs plot, I notice that the tax varibale seems to effect the clustering results mostly.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color = ~train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color= ~km$cluster[1:404])
This work includes performing and interpreting dimensionality and reduction techniques. Specifically, this work includes plotting the data frame, Perform principal component analysis(PCA), and multiple correspondence analysis(MCA).
Load the human data from the website. The data is titled human data and it contains 155 ovservations and 8 variables.
date()
## [1] "Tue Dec 13 07:39:51 2022"
#library
library(GGally)
library(ggplot2)
library(dplyr)
library(tidyverse)
# read the data into memory
human <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt",
sep =",", header = T)
# explore the dataset
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# remove the Country variable
#human_ <- select(human, -Country)
# Access GGally
library(GGally)
# visualize the 'human_' variables
ggpairs(human[1:4])
ggpairs(human[5:8])
# Access corrplot
library(corrplot)
# compute the correlation matrix and visualize it with corrplot
corr<-cor(human)
corrplot(corr)
Part 5.1 shows the graphical overview of the variables. The distributions of the variables can be found in the second and third plots. From where we conclude that almost all variables are skewed distributed. The last plot is correlation plot which clearly shows the correlation among the variables.
This part Performs principal component analysis (PCA) on the raw (non-standardized) human data.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# pca_human, dplyr are available
pca_human <- prcomp(human)
# create and print out a summary of pca_human
s <- summary(pca_human)
biplot(pca_human, choices = 1:2,cex = c(1, 1),col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
# rounded percentanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
From the PCA result, the first priciple component captures all the variability, while the second captures 0.
This part Performs principal component analysis (PCA) on the standardized human data.
# Work with the exercise in this chunk, step-by-step. Fix the R code!
human_<-scale(human)
# pca_human, dplyr are available
pca_human_ <- prcomp(human_)
# create and print out a summary of pca_human
s <- summary(pca_human_)
# rounded percentanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# create object pc_lab to be used as axis labels
paste0(names(pca_pr), " (", pca_pr, "%)")
## [1] "PC1 (53.6%)" "PC2 (16.2%)" "PC3 (9.6%)" "PC4 (7.6%)" "PC5 (5.5%)"
## [6] "PC6 (3.6%)" "PC7 (2.6%)" "PC8 (1.3%)"
pc_lab<-pca_pr
# draw a biplot
biplot(pca_human_, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
The PCA on standardized data is totally different from the PCA on raw data. In the part 5.2 where I perform PCA on raw data, the first PC accounts for 100 percent of the variability and also the variable GNI accounts almost 100% of the first component. This is because the value of variable GNI is way too large. After standardization, the results become normal. The first component captures 53.6 percent of the total variability where the second captures 16.2. Also from this plot, I am able to find the correlation between the variables and the correalation between the variables and the PCs. This is done by noticing that: - The angle between the arrows can be interpreted as the correlation between the variables. - The angle between a variable and a PC axis can be interpreted as the correlation between the two. - The length of the arrows are proportional to the standard deviations of the variables.
In this part,I interpret the PCs. From the plot and interpretation method I discribe above, the variables Labor.FM and Parli.F mainly contribute to PC2, while the others mainly contribute to PC1.
In this part, I explore the tea data and perform Multiple Correspondence Analysis (MCA) on it. The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions).
# tea_time is available
library(dplyr)
library(tidyr)
library(ggplot2)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
library(ggplot2)
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free")+ geom_bar()+ theme(axis.text.x = element_text(angle=20, hjust = 1, size = 8))
# multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic",habillage = "quali")
From the MCA factor map, I observe clearly how the variables account for each dimension. For example, the green variable only account for the dimension 2 while tea bag mostly account for dimension 1.
*This work includes performing and interpreting analysis of longitudinal data technique.
library(tidyr); library(dplyr); library(ggplot2)
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
glimpse(RATS)
## Rows: 16
## Columns: 13
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ WD1 <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470,…
## $ WD8 <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 560, 465,…
## $ WD15 <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 565, 475,…
## $ WD22 <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 580, 485,…
## $ WD29 <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 590, 487,…
## $ WD36 <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 597, 493,…
## $ WD43 <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 595, 493,…
## $ WD44 <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 595, 504,…
## $ WD50 <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 612, 507,…
## $ WD57 <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 618, 518,…
## $ WD64 <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 628, 525,…
head(RATS)
## ID Group WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## 1 1 1 240 250 255 260 262 258 266 266 265 272 278
## 2 2 1 225 230 230 232 240 240 243 244 238 247 245
## 3 3 1 245 250 250 255 262 265 267 267 264 268 269
## 4 4 1 260 255 255 265 265 268 270 272 274 273 275
## 5 5 1 255 260 255 270 270 273 274 273 276 278 280
## 6 6 1 260 265 270 275 275 277 278 278 284 279 281
tail(RATS)
## ID Group WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## 11 11 2 445 445 450 452 455 455 451 450 462 466 472
## 12 12 2 555 560 565 580 590 597 595 595 612 618 628
## 13 13 3 470 465 475 485 487 493 493 504 507 518 525
## 14 14 3 535 525 530 533 535 540 525 530 543 544 559
## 15 15 3 520 525 530 540 543 546 538 544 553 555 548
## 16 16 3 510 510 520 515 530 538 535 542 550 553 569
# Convert data to long form:
RATSL <- pivot_longer(RATS, cols=-c(ID,Group), names_to = "WD",values_to = "Weight") %>% mutate(Time = as.integer(substr(WD,3,4))) %>% arrange(Time)
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
head(RATSL)
## # A tibble: 6 × 5
## ID Group WD Weight Time
## <fct> <fct> <chr> <int> <int>
## 1 1 1 WD1 240 1
## 2 2 1 WD1 225 1
## 3 3 1 WD1 245 1
## 4 4 1 WD1 260 1
## 5 5 1 WD1 255 1
## 6 6 1 WD1 260 1
tail(RATSL)
## # A tibble: 6 × 5
## ID Group WD Weight Time
## <fct> <fct> <chr> <int> <int>
## 1 11 2 WD64 472 64
## 2 12 2 WD64 628 64
## 3 13 3 WD64 525 64
## 4 14 3 WD64 559 64
## 5 15 3 WD64 548 64
## 6 16 3 WD64 569 64
p1 <- ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ Group, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
p6
# Standardise the scores:
RATSL <- RATSL %>%
group_by(Time) %>%
mutate( stdweight = (Weight - mean(Weight))/sd(Weight) ) %>%
ungroup()
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1…
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, …
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, …
## $ stdweight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, …
p1 <- ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ Group, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(name = "standardized weight")
p6
# Number of subjects (per group):
n <- 58
# Make a summary data:
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean=mean(Weight), se=sd(Weight)/sqrt(n) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se <dbl> 1.998691, 1.719205, 1.506860, 1.785874, 1.451918, 1.547284, 1.43…
p1 <- ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group))
p2 <- p1 + geom_line() + scale_linetype_manual(values = c(1,2,3))
p3 <- p2 + geom_point(size=3) + scale_shape_manual(values = c(1,2,3))
p4 <- p3 + geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3)
p5 <- p4 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6 <- p5 + theme(legend.position = c(0.8,0.8))
p7 <- p6 + scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
p7
p1 <- ggplot(RATSL, aes(x = factor(Time), y = Weight, fill = Group))
p2 <- p1 + geom_boxplot(position = position_dodge(width = 0.9))
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + theme(legend.position = c(0.8,0.8))
p5 <- p4 + scale_x_discrete(name = "Time")
# Black & White version:
#p6 <- p5 + scale_fill_grey(start = 0.5, end = 1)
p5
# Make a summary data of the post Group Times (1-8)
RATSL8S <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
glimpse(RATSL8S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 4…
p1 <- ggplot(RATSL8S, aes(x = Group, y = mean))
p2 <- p1 + geom_boxplot()
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white")
## Warning: `fun.y` is deprecated. Use `fun` instead.
p5 <- p4 + scale_y_continuous(name = "mean(Weight), Times 1-8")
p5
# Remove the outlier:
RATSL8S1 <- RATSL8S %>%
filter(mean < 600)
glimpse(RATSL8S1)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 4…
p1 <- ggplot(RATSL8S1, aes(x = Group, y = mean))
p2 <- p1 + geom_boxplot()
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white")
## Warning: `fun.y` is deprecated. Use `fun` instead.
p5 <- p4 + scale_y_continuous(name = "mean(Weight), Times 1-8")
p5
# Without the outlier, apply Student's t-test, two-sided:
#t.test(mean ~ Group, data = RATSL8S1, var.equal = TRUE)
# Add the baseline from the original data as a new variable to the summary data:
baseline <- RATS$WD1
RATSL8S2 <- RATSL8S %>%
mutate(baseline)
# Fit the ANCOVA model and see the results:
fit <- lm(mean ~ baseline + Group, data = RATSL8S2)
summary(fit)
##
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSL8S2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.905 -4.194 2.190 7.577 14.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.16375 21.87657 1.516 0.1554
## baseline 0.92513 0.08572 10.793 1.56e-07 ***
## Group2 34.85753 18.82308 1.852 0.0888 .
## Group3 23.67526 23.25324 1.018 0.3287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.68 on 12 degrees of freedom
## Multiple R-squared: 0.9936, Adjusted R-squared: 0.992
## F-statistic: 622.1 on 3 and 12 DF, p-value: 1.989e-13
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(tidyr); library(dplyr); library(ggplot2)
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
glimpse(BPRS)
## Rows: 40
## Columns: 11
## $ treatment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ week0 <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week1 <int> 36, 68, 55, 77, 75, 43, 61, 36, 43, 51, 34, 52, 32, 35, 68, …
## $ week2 <int> 36, 61, 41, 49, 72, 41, 47, 38, 39, 51, 34, 49, 36, 36, 65, …
## $ week3 <int> 43, 55, 38, 54, 65, 38, 30, 38, 35, 55, 41, 54, 31, 34, 49, …
## $ week4 <int> 41, 43, 43, 56, 50, 36, 27, 31, 28, 53, 36, 48, 25, 25, 36, …
## $ week5 <int> 40, 34, 28, 50, 39, 29, 40, 26, 22, 43, 36, 43, 25, 27, 32, …
## $ week6 <int> 38, 28, 29, 47, 32, 33, 30, 26, 20, 43, 38, 37, 21, 25, 27, …
## $ week7 <int> 47, 28, 25, 42, 38, 27, 31, 25, 23, 39, 36, 36, 19, 26, 30, …
## $ week8 <int> 51, 28, 24, 46, 32, 25, 31, 24, 21, 32, 36, 31, 22, 26, 37, …
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <- pivot_longer(BPRS, cols=-c(treatment,subject),names_to = "weeks",values_to = "bprs") %>% arrange(weeks)
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,5)))
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
head(BPRSL); tail(BPRSL)
## # A tibble: 6 × 5
## treatment subject weeks bprs week
## <fct> <fct> <chr> <int> <int>
## 1 1 1 week0 42 0
## 2 1 2 week0 58 0
## 3 1 3 week0 54 0
## 4 1 4 week0 55 0
## 5 1 5 week0 72 0
## 6 1 6 week0 48 0
## # A tibble: 6 × 5
## treatment subject weeks bprs week
## <fct> <fct> <chr> <int> <int>
## 1 2 15 week8 37 8
## 2 2 16 week8 22 8
## 3 2 17 week8 43 8
## 4 2 18 week8 30 8
## 5 2 19 week8 21 8
## 6 2 20 week8 23 8
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, group = subject))
p2 <- p1 + geom_text(aes(label = treatment))
p3 <- p2 + scale_x_continuous(name = "week", breaks = seq(0, 60, 10))
p4 <- p3 + scale_y_continuous(name = "bprs")
p5 <- p4 + theme_bw()
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
# dummies (in Table) vs summary output: D1 = Group2, D2 = Group3
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, group = subject))
p2 <- p1 + geom_line()
p3 <- p2 + scale_x_continuous(name = "week", breaks = seq(0, 3,1))
p4 <- p3 + scale_y_continuous(name = "bprs")
p5 <- p4 + theme_bw() + theme(legend.position = "top")
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6
pairs(BPRS[, 3:11], cex = 0.7)
library("lme4")
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
# dummies (in Table) vs summary output: D1 = Group2, D2 = Group3
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# dummies (in Table) vs summary output: D1 = Group2, D2 = Group3
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
# dummies (in Table) vs summary output: D1 = Group2, D2 = Group3
anova(BPRS_ref1, BPRS_ref2)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fitted <- fitted(BPRS_ref2)
BPRSL <- BPRSL %>% mutate(Fitted)
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, group = subject))
p2 <- p1 + geom_line()
p3 <- p2 + scale_x_continuous(name = "week", breaks = seq(0, 3, 1))
p4 <- p3 + scale_y_continuous(name = "bprs")
p5 <- p4 + theme_bw() + theme(legend.position = "right") # "none" in the book
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p7 <- p6 + ggtitle("Observed")
graph1 <- p7
p1 <- ggplot(BPRSL, aes(x = week, y = Fitted, group = subject))
p2 <- p1 + geom_line()
p3 <- p2 + scale_x_continuous(name = "Tweel", breaks = seq(0, 3, 1))
p4 <- p3 + scale_y_continuous(name = "bprs")
p5 <- p4 + theme_bw() + theme(legend.position = "right")
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p7 <- p6 + ggtitle("Fitted")
graph2 <- p7
graph1; graph2
(more chapters to be added similarly as we proceed with the course!)